conflicts: escape conflict markers by making them longer #4969

scott2000 · 2024-11-25T14:15:10Z

Related to #3975 and #4088.

Checklist

If applicable:

I have updated CHANGELOG.md
I have updated the documentation (README.md, docs/, demos/)
I have updated the config schema (cli/src/config-schema.json)
I have added tests to cover my changes

joyously · 2024-11-25T20:23:44Z

Perhaps this should be tested by correctly parsing its own source code.

scott2000 · 2024-11-25T21:02:35Z

Perhaps this should be tested by correctly parsing its own source code.

Surprisingly, the source code for conflicts.rs doesn't actually contain any lines that look like conflict markers! The only two files that have lines which can be confused for conflict markers currently are tutorial.md and conflicts.md it looks like.

lib/src/conflicts.rs

yuja · 2024-11-27T01:27:58Z

lib/src/conflicts.rs

+    debug_assert!(expected_len >= MIN_CONFLICT_MARKER_LEN);
+
+    parse_conflict_marker_any_len(line)
+        .filter(|marker| marker.len == expected_len)


Maybe we need to accept marker.len >= expected_len?

Suppose working-copy content was materialized with N = 10, and user removed a literal "<<<<<<<<<" (N = 9) without resolving any other conflicts, the new expected_len would drop. iirc, the working-copy content isn't updated to use the new expected_len.

That's what I had originally, but I think it's too permissive actually. If we just had a minimum, and didn't enforce an exact number, then this could happen:

Conflict materialized with length 12 markers since file contains length 8 marker

User deletes length 8 marker

Conflict is parsed expecting length >=12 markers successfully

User adds length 8 marker back, since they made a mistake while editing

Conflict is parsed expecting length >=7 markers now, so both the length 8 and length 12 markers are parsed, even though the length 8 marker isn't a real marker

To solve this, I made it so that only the longest length markers in a file are parsed. So the example would look like this:

Conflict materialized with length 12 markers since file contains length 8 marker

User deletes length 8 marker

Conflict is parsed expecting length 12 markers successfully

User adds length 8 marker back, since they made a mistake while editing

Conflict is parsed expecting length 12 markers (since 12 is the longest marker present, and 12 >= 7), so the length 8 markers are correctly ignored

Hmm, we can't know whether the re-added marker is a literal or conflict, right? My feeling is that it's safer to assume that a parsable marker is a conflict rather than invalidating any other conflicts by single addition of marker-looking line.

If we prefer stricter behavior, we'll probably need to store the materialized marker length to TreeState on checkout.

Yeah, that's a good point. Sorry for the long response; I was just trying to think through all the reasonable options there might be and summarize them. I think these are the variables we can use for setting a bound:

R = length of real conflict markers in original materialization (would need to be stored in TreeState)

N = length of longest fake conflict markers in last snapshot's Merge<FileId>

K = increment added to longest fake marker to get materialized marker length

C = length of longest conflict marker (real or fake) in the raw file contents being snapshotted

I think it's important that we have a simple rule that the user can follow while editing a file to ensure that the real conflict markers will continue to be parsed successfully. These are the 4 options I can think of, along with the expressions that we could use to determine if a marker is parsable:

Possible options

Option 1: "Don't add any new fake markers which are longer than any fake markers in the last snapshotted version of the file"

Either len >= max(7, N+K) or len >= max(7, N+1) would work here.

Option 2: "Don't add any new fake markers which are longer than any fake markers in the original materialized file"

I think len == max(7, N+K, C) matches this rule, which is what I currently have implemented. The main difference with this and option 1 is that the user is allowed to add back fake conflict markers that they removed, and they still won't be parsed.

Option 3: "Don't add any new fake markers with length greater than or equal to the markers that jj added"

I think len >= R is the obvious implementation of this rule. It does require adding it to the TreeState though.

I think len == max(7, N+1, C) would also fall into this category, since the user could still add any fake markers that are shorter than the original materialized markers, since that would keep the original materialized markers as being the longest in the file, so it would continue parsing them successfully. This is different from len == max(7, N+K, C), because that rule would cause snapshotting to fail the next time after adding any fake markers of length N+1 through N+K-1, since that would result in the real conflict markers becoming considered shorter than allowed in the following snapshot.

Option 4: "Don't add any new fake markers with the exact same length as the markers that jj added"

This is probably the easiest rule to follow for the user, and it is what we currently have in jj 0.23.0, except that the materialized length is always 7. This could be achieved with len == R, which also requires adding it to the TreeState.

If we don't store the materialized marker length in the TreeState, then I believe adding a fake marker which is longer than the real materialized markers will always cause the real markers to not be parsed correctly. With some implementations it might fail to parse in the current snapshot (because it picked the wrong markers to parse), or with other implementations it might parse successfully in the current snapshot, but then fail in the next snapshot (because the inferred minimum conflict marker length increased as a result of there being new fake conflict markers added), or it could parse successfully but just have the wrong content (since some fake markers were treated as real ones and happened to form a valid conflict).

With that in mind, I feel like these are the 3 best options, each with different tradeoffs (I think this is basically what you were saying, with the first and last options being the safest choices):

len >= max(7, N+1), since it is the simplest to implement, and it handles all of the common cases reasonably well.

len == max(7, N+1, C), since it allows adding any fake markers that are shorter than the real markers, which the user might expect to work.

len == R since it relies on the fewest assumptions (but it requires storing marker length in the TreeState).

I think I'll look into adding it to the TreeState to see how difficult it would be. If it's a small change, then probably that would be the best option?

I think I'll look into adding it to the TreeState to see how difficult it would be. If it's a small change, then probably that would be the best option?

It wouldn't be difficult, but I'm not pretty sure if that's better because we're adding a new persistent state invisible to user. (BTW, the materialized conflict style could be stored in the TreeState if supported styles had ambiguous markers.)

@ilyagr, @martinvonz, any comments?

R = length of real conflict markers in original materialization (would need to be stored in TreeState)

Do we need to store it in the TreeState? Can we not get it from the previous content in the tree object?

Do you mean getting it from the tree of the working-copy parent? It might work, but I'm not sure.

FWIW, I think it makes sense to trust TreeState here because the materialization parameter is kinda working-copy state, not a state of the committed tree.

I updated the PR to first implement parsing if >= materialized_marker_len, and then I added conflict_marker_len to the file state in another commit to see how it could work. If we decide not to add it to the file state, I can always drop the last commit from the PR.

Do you mean getting it from the tree of the working-copy parent? It might work, but I'm not sure.

I meant getting it from the previous snapshot. But it seems like it would not handle the case in test_update_from_content_malformed_conflict anyway (unless we allowed the snapshotting process to make the conflict markers in the working copy longer, but that seems like a really bad idea).

I was thinking that it's good if TreeState is just a cache that can we can discard without impact other than in the corner case where a file has been modified in a way that the cache prevents us from detecting (currently if the mtime and size is preserved but e.g. inode is changed).

I was thinking that it's good if TreeState is just a cache that can we can discard without impact other than in the corner case where a file has been modified in a way that the cache prevents us from detecting (currently if the mtime and size is preserved but e.g. inode is changed).

Currently, I have it set to default to parsing length 7 conflict markers if there's no materialized conflict data in the TreeState, so this should still work in many cases since it will parse all conflict markers of length >= 7 if the TreeState is reset, and the conflict markers should usually be length 7 anyways.

If we want to make it possible to reset the TreeState in more cases, we could have the conflict marker length default to choose_materialized_conflict_marker_len(current_tree_merge) instead maybe? There is a tradeoff though, because doing this could break already-materialized conflicts from older versions of jj if a user upgrades and the new materialized markers are longer than the actual markers in the file (which would be length 7 always in older versions).

Sounds good to keep it simple for now.

lib/src/conflicts.rs

yuja · 2024-11-27T01:39:24Z

lib/src/conflicts.rs

+    // considered longer than necessary the next time the file is snapshotted.
+    let expected_marker_len = materialized_marker_len
+        .unwrap_or_else(|| choose_conflict_marker_len(&merge_hunk))
+        .max(max_content_marker_len);


Should we trust the marker length of the new content? Doesn't it mean any existing conflicts can be invalidated if user added long <<<<<<<<<<<<< literal by mistake?

Yeah, they would be invalidated if the user adds a marker with length greater than or equal to the materialized marker length. I'm not sure we can work around that though, since there will always be some length of marker that's invalid for them to add.

My goal was to make it so that if a file initially had fake markers of length N when it was materialized, then the user can always add/remove as many markers as they want with length <= N without breaking parsing, even if multiple snapshot operations happen during the process. I think it makes sense that adding markers of length > N could interfere with conflict resolution, since it makes it ambiguous whether the marker is a real one or not.

lib/src/conflicts.rs

martinvonz

Sorry, I thought I had already sent these comments last week

martinvonz · 2024-11-25T16:51:11Z

lib/src/conflicts.rs

+/// Represents a kind of conflict marker which can be materialized and parsed.
+#[derive(Clone, Copy, PartialEq, Eq)]
+#[repr(u8)]
+enum ConflictMarkerKind {


nit: How about ConflictMarkerSeparator or ConflictMarkerSeparatorChar since it just represents individual separators rather than the whole marker?

I'm not sure how I feel about "separator". I personally have always thought of each individual line like "<<<<<<<" as being markers, rather than just being a part of a marker. Maybe ConflictMarkerLine and ConflictMarkerLineChar would be clearer? I don't feel strongly though.

I think ConflictMarkerChar, LineChar, LineKind are good, if the original concern was that ConflictMarkerKind sounds like a ConflictMarkerStyle.

martinvonz · 2024-11-28T13:49:16Z

lib/src/conflicts.rs

+    debug_assert!(expected_len >= MIN_CONFLICT_MARKER_LEN);
+
+    parse_conflict_marker_any_len(line)
+        .filter(|marker| marker.len == expected_len)


That's a good point about the invisible state. I think that's enough to convince me that one of the options is better. I'm leaning towards the first choice because it's simplest. But it's a good point about in the second option about presence of markers of different lengths. These would be real markers or markers that we would at least interpret as real. Maybe another option is to use option 1 but additionally require that all markers have the same length, or otherwise interpret the file as resolved. I don't strongly which of these three options we go with.

By the way, I think background snapshotting makes Scott's example at the start of this thread more important because you might save and then undo and save again and get two snapshots, which would then get an additional real conflict. OTOH, background snapshotting already makes it so a save while your editing conflict markers often results the file getting interpreted as resolved instead, so we're not making it much worse by adding this new risk of reinterpreting markers.

lib/src/conflicts.rs

lib/tests/test_conflicts.rs

yuja · 2024-12-08T03:53:29Z

lib/tests/test_conflicts.rs

+    "};
+    // On the first snapshot, it will parse as a conflict containing conflict
+    // markers as text
+    let new_conflict = parse(&conflict, new_conflict_contents.as_bytes());


Indeed. This one is scary, and would probably provide a worse UX than the current state.

We'll have to either remember the materialized marker length or parse unpaired markers into some form of conflict.

cli/src/diff_util.rs

yuja · 2024-12-08T03:58:07Z

lib/src/conflicts.rs

+/// Represents a kind of conflict marker which can be materialized and parsed.
+#[derive(Clone, Copy, PartialEq, Eq)]
+#[repr(u8)]
+enum ConflictMarkerKind {


I think ConflictMarkerChar, LineChar, LineKind are good, if the original concern was that ConflictMarkerKind sounds like a ConflictMarkerStyle.

lib/tests/test_conflicts.rs

martinvonz · 2024-12-08T18:33:26Z

lib/src/conflicts.rs

+    debug_assert!(expected_len >= MIN_CONFLICT_MARKER_LEN);
+
+    parse_conflict_marker_any_len(line)
+        .filter(|marker| marker.len == expected_len)


Do you mean getting it from the tree of the working-copy parent? It might work, but I'm not sure.

I meant getting it from the previous snapshot. But it seems like it would not handle the case in test_update_from_content_malformed_conflict anyway (unless we allowed the snapshotting process to make the conflict markers in the working copy longer, but that seems like a really bad idea).

I was thinking that it's good if TreeState is just a cache that can we can discard without impact other than in the corner case where a file has been modified in a way that the cache prevents us from detecting (currently if the mtime and size is preserved but e.g. inode is changed).

lib/tests/test_conflicts.rs

These changes make the code a bit more readable, and they will make it easier to have conflict markers of different lengths in the next commit.

If a file contains lines which look like conflict markers, then we need to make the real conflict markers longer so that the materialized conflicts can be parsed unambiguously. When parsing the conflict, we require that the conflict markers are at least as long as the materialized conflict markers based on the current tree. This can lead to some unintuitive edge cases which will be solved in the next commit. For instance, if we have a file explaining the differences between Jujutsu's conflict markers and Git's conflict markers, it could produce a conflict with long markers like this: ``` <<<<<<<<<<< Conflict 1 of 1 %%%%%%%%%%% Changes from base to side jj-vcs#1 Jujutsu uses different conflict markers than Git, which just shows the -sides of a conflict without a diff. +sides of a conflict without a diff: + +<<<<<<< +left +||||||| +base +======= +right +>>>>>>> +++++++++++ Contents of side jj-vcs#2 Jujutsu uses different conflict markers than Git: <<<<<<< %%%%%%% -base +left +++++++ right >>>>>>> >>>>>>>>>>> Conflict 1 of 1 ends ``` We should support these options for "git" conflict marker style as well, since Git actually does support producing longer conflict markers in some cases through .gitattributes: https://git-scm.com/docs/gitattributes#_conflict_marker_size We may also want to support passing the conflict marker length to merge tools as well in the future, since Git supports a "%L" parameter to pass the conflict marker length to merge drivers: https://git-scm.com/docs/gitattributes#_defining_a_custom_merge_driver

Storing the conflict marker length in the working copy makes conflict parsing more consistent, and it allows us to parse valid conflict hunks even if the user left some invalid conflict markers in the file while resolving the conflicts.

scott2000 mentioned this pull request Nov 25, 2024

conflicts: remove use of Regex from conflict marker parsing #4968

Closed

4 tasks

scott2000 force-pushed the escape-conflict-markers branch 4 times, most recently from c547024 to d62d14c Compare November 25, 2024 17:29

scott2000 force-pushed the escape-conflict-markers branch from d62d14c to 6446b3d Compare November 26, 2024 15:35

scott2000 marked this pull request as ready for review November 26, 2024 15:46

scott2000 commented Nov 26, 2024

View reviewed changes

lib/src/conflicts.rs Show resolved Hide resolved

scott2000 force-pushed the escape-conflict-markers branch from 6446b3d to 9f0be3b Compare November 26, 2024 20:29

yuja reviewed Nov 27, 2024

View reviewed changes

scott2000 force-pushed the escape-conflict-markers branch from 9f0be3b to cb50ba3 Compare December 3, 2024 01:25

martinvonz reviewed Dec 3, 2024

View reviewed changes

ilyagr mentioned this pull request Dec 3, 2024

conflicts: add "ui.conflict-marker-style" config #4897

Merged

4 tasks

scott2000 force-pushed the escape-conflict-markers branch 5 times, most recently from 8b28066 to 5299c8f Compare December 7, 2024 15:41

yuja reviewed Dec 8, 2024

View reviewed changes

scott2000 force-pushed the escape-conflict-markers branch from 5299c8f to 82cad26 Compare December 8, 2024 16:22

martinvonz reviewed Dec 8, 2024

View reviewed changes

scott2000 force-pushed the escape-conflict-markers branch 4 times, most recently from cf0c834 to b99285a Compare December 17, 2024 01:05

scott2000 added 3 commits December 17, 2024 14:49

conflicts: refactor conflict marker writing and parsing

6ac0514

These changes make the code a bit more readable, and they will make it easier to have conflict markers of different lengths in the next commit.

scott2000 force-pushed the escape-conflict-markers branch from b99285a to f2d6c28 Compare December 17, 2024 21:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conflicts: escape conflict markers by making them longer #4969

conflicts: escape conflict markers by making them longer #4969

scott2000 commented Nov 25, 2024 •

edited

Loading

joyously commented Nov 25, 2024

scott2000 commented Nov 25, 2024

yuja Nov 27, 2024

scott2000 Nov 27, 2024

yuja Nov 27, 2024

scott2000 Nov 27, 2024 •

edited

Loading

yuja Nov 28, 2024

yuja Dec 6, 2024

scott2000 Dec 7, 2024

martinvonz Dec 8, 2024

scott2000 Dec 16, 2024

martinvonz Dec 16, 2024

yuja Nov 27, 2024

scott2000 Nov 27, 2024 •

edited

Loading

martinvonz left a comment

martinvonz Nov 25, 2024

scott2000 Dec 3, 2024

yuja Dec 8, 2024

martinvonz Nov 28, 2024

yuja Dec 8, 2024

yuja Dec 8, 2024

martinvonz Dec 8, 2024

conflicts: escape conflict markers by making them longer #4969

Are you sure you want to change the base?

conflicts: escape conflict markers by making them longer #4969

Conversation

scott2000 commented Nov 25, 2024 • edited Loading

Checklist

joyously commented Nov 25, 2024

scott2000 commented Nov 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scott2000 Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scott2000 Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

martinvonz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scott2000 commented Nov 25, 2024 •

edited

Loading

scott2000 Nov 27, 2024 •

edited

Loading

scott2000 Nov 27, 2024 •

edited

Loading